function prediction
Enhancing Multimodal Protein Function Prediction Through Dual-Branch Dynamic Selection with Reconstructive Pre-Training
Luo, Xiaoling, Chen, Peng, Liu, Chengliang, Jin, Xiaopeng, Wen, Jie, Liu, Yumeng, Wang, Junsong
Multimodal protein features play a crucial role in protein function prediction. However, these features encompass a wide range of information, ranging from structural data and sequence features to protein attributes and interaction networks, making it challenging to decipher their complex interconnections. In this work, we propose a multimodal protein function prediction method (DSRPGO) by utilizing dynamic selection and reconstructive pre-training mechanisms. To acquire complex protein information, we introduce reconstructive pre-training to mine more fine-grained information with low semantic levels. Moreover, we put forward the Bidirectional Interaction Module (BInM) to facilitate interactive learning among multimodal features. Additionally, to address the difficulty of hierarchical multi-label classification in this task, a Dynamic Selection Module (DSM) is designed to select the feature representation that is most conducive to current protein function prediction. Our proposed DSRPGO model improves significantly in BPO, MFO, and CCO on human datasets, thereby outperforming other benchmark models.
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Health & Medicine > Therapeutic Area > Neurology (0.46)
A Novel Framework for Multi-Modal Protein Representation Learning
Zheng, Runjie, Wang, Zhen, Qiao, Anjie, Xie, Jiancong, Rao, Jiahua, Yang, Yuedong
Accurate protein function prediction requires integrating heterogeneous intrinsic signals (e.g., sequence and structure) with noisy extrinsic contexts (e.g., protein-protein interactions and GO term annotations). However, two key challenges hinder effective fusion: (i) cross-modal distributional mismatch among embeddings produced by pre-trained intrinsic encoders, and (ii) noisy relational graphs of extrinsic data that degrade GNN-based information aggregation. We propose Diffused and Aligned Multi-modal Protein Embedding (DAMPE), a unified framework that addresses these through two core mechanisms. First, we propose Optimal Transport (OT)-based representation alignment that establishes correspondence between intrinsic embedding spaces of different modalities, effectively mitigating cross-modal heterogeneity. Second, we develop a Conditional Graph Generation (CGG)-based information fusion method, where a condition encoder fuses the aligned intrinsic embeddings to provide informative cues for graph reconstruction. Meanwhile, our theoretical analysis implies that the CGG objective drives this condition encoder to absorb graph-aware knowledge into its produced protein representations. Empirically, DAMPE outperforms or matches state-of-the-art methods such as DPFunc on standard GO benchmarks, achieving AUPR gains of 0.002-0.013 pp and Fmax gains 0.004-0.007 pp. Ablation studies further show that OT-based alignment contributes 0.043-0.064 pp AUPR, while CGG-based fusion adds 0.005-0.111 pp Fmax. Overall, DAMPE offers a scalable and theoretically grounded approach for robust multi-modal protein representation learning, substantially enhancing protein function prediction.
- North America > United States (0.14)
- Asia > China > Guangdong Province > Guangzhou (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.68)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- North America > Canada (0.04)
- Europe > United Kingdom > Wales (0.04)
- (2 more...)
AnnoDPO: Protein Functional Annotation Learning with Direct Preference Optimization
Deciphering protein function remains a fundamental challenge in protein representation learning. The task presents significant difficulties for protein language models (PLMs) due to the sheer volume of functional annotation categories and the highly imbalanced distribution of annotated instances across biological ontologies. Inspired by the remarkable success of reinforcement learning from human feedback (RLHF) in large language model (LLM) alignment, we propose AnnoDPO, a novel multi-modal framework for protein function prediction that leverages Direct Preference Optimization (DPO) to enhance annotation learning. Our methodology addresses the dual challenges of annotation scarcity and category imbalance through preference-aligned training objectives, establishing a new paradigm for biological knowledge integration in protein representation learning.
- North America > Canada (0.04)
- Asia > China > Hong Kong (0.04)
- Asia > China > Guangdong Province > Guangzhou (0.04)
Protein Large Language Models: A Comprehensive Survey
Xiao, Yijia, Zhao, Wanjia, Zhang, Junkai, Jin, Yiqiao, Zhang, Han, Ren, Zhicheng, Sun, Renliang, Wang, Haixin, Wan, Guancheng, Lu, Pan, Luo, Xiao, Zhang, Yu, Zou, James, Sun, Yizhou, Wang, Wei
Protein-specific large language models (Protein LLMs) are revolutionizing protein science by enabling more efficient protein structure prediction, function annotation, and design. While existing surveys focus on specific aspects or applications, this work provides the first comprehensive overview of Protein LLMs, covering their architectures, training datasets, evaluation metrics, and diverse applications. Through a systematic analysis of over 100 articles, we propose a structured taxonomy of state-of-the-art Protein LLMs, analyze how they leverage large-scale protein sequence data for improved accuracy, and explore their potential in advancing protein engineering and biomedical research. Additionally, we discuss key challenges and future directions, positioning Protein LLMs as essential tools for scientific discovery in protein science. Resources are maintained at https://github.com/Yijia-Xiao/Protein-LLM-Survey.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Texas (0.04)
- Asia > Middle East > Yemen > Amanat Al Asimah > Sanaa (0.04)
Transformers trained on proteins can learn to attend to Euclidean distance
Ellmen, Isaac, Schneider, Constantin, Raybould, Matthew I. J., Deane, Charlotte M.
While conventional Transformers generally operate on sequence data, they can be used in conjunction with structure models, typically SE(3)-invariant or equivariant graph neural networks (GNNs), for 3D applications such as protein structure modelling. These hybrids typically involve either (1) preprocessing/tokenizing structural features as input for Transformers or (2) taking Transformer embeddings and processing them within a structural representation. However, there is evidence that Transformers can learn to process structural information on their own, such as the AlphaFold3 structural diffusion model. In this work we show that Transformers can function independently as structure models when passed linear embeddings of coordinates. We first provide a theoretical explanation for how Transformers can learn to filter attention as a 3D Gaussian with learned variance. We then validate this theory using both simulated 3D points and in the context of masked token prediction for proteins. Finally, we show that pre-training protein Transformer encoders with structure improves performance on a downstream task, yielding better performance than custom structural models. Together, this work provides a basis for using standard Transformers as hybrid structure-language models.
GoBERT: Gene Ontology Graph Informed BERT for Universal Gene Function Prediction
Miao, Yuwei, Guo, Yuzhi, Ma, Hehuan, Yan, Jingquan, Jiang, Feng, Liao, Rui, Huang, Junzhou
Exploring the functions of genes and gene products is crucial to a wide range of fields, including medical research, evolutionary biology, and environmental science. However, discovering new functions largely relies on expensive and exhaustive wet lab experiments. Existing methods of automatic function annotation or prediction mainly focus on protein function prediction with sequence, 3D-structures or protein family information. In this study, we propose to tackle the gene function prediction problem by exploring Gene Ontology graph and annotation with BERT (GoBERT) to decipher the underlying relationships among gene functions. Our proposed novel function prediction task utilizes existing functions as inputs and generalizes the function prediction to gene and gene products. Specifically, two pre-train tasks are designed to jointly train GoBERT to capture both explicit and implicit relations of functions. Neighborhood prediction is a self-supervised multi-label classification task that captures the explicit function relations. Specified masking and recovering task helps GoBERT in finding implicit patterns among functions. The pre-trained GoBERT possess the ability to predict novel functions for various gene and gene products based on known functional annotations. Extensive experiments, biological case studies, and ablation studies are conducted to demonstrate the superiority of our proposed GoBERT.
- Information Technology > Biomedical Informatics > Translational Bioinformatics (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (0.62)
Training on test proteins improves fitness, structure, and function prediction
Bushuiev, Anton, Bushuiev, Roman, Zadorozhny, Nikola, Samusevich, Raman, Stärk, Hannes, Sedlar, Jiri, Pluskal, Tomáš, Sivic, Josef
Data scarcity and distribution shifts often hinder the ability of machine learning models to generalize when applied to proteins and other biological data. Self-supervised pre-training on large datasets is a common method to enhance generalization. However, striving to perform well on all possible proteins can limit model's capacity to excel on any specific one, even though practitioners are often most interested in accurate predictions for the individual protein they study. To address this limitation, we propose an orthogonal approach to achieve generalization. Building on the prevalence of self-supervised pre-training, we introduce a method for self-supervised fine-tuning at test time, allowing models to adapt to the test protein of interest on the fly and without requiring any additional data. We study our test-time training (TTT) method through the lens of perplexity minimization and show that it consistently enhances generalization across different models, their scales, and datasets. Notably, our method leads to new state-of-the-art results on the standard benchmark for protein fitness prediction, improves protein structure prediction for challenging targets, and enhances function prediction accuracy.
- Europe > Austria > Vienna (0.14)
- North America > United States > Massachusetts (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (7 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
A Trilogy of AI Safety Frameworks: Paths from Facts and Knowledge Gaps to Reliable Predictions and New Knowledge
A Trilogy of AI Safety Frameworks: Paths from Facts and Knowledge Gaps to Reliable Predictions and New Knowledge Simon Kasif Department of Biomedical Engineering Program in Bioinformatics Department of Computer Science Boston University AI Safety has become a vital front-line concern of many scientists within and outside the AI community. There are many immediate and long term anticipated risks that range from existential risk to human existence to deep fakes and bias in machine learning systems [1-5]. In this paper, we reduce the full scope and immense complexity of AI safety concerns to a trilogy of three important but tractable opportunities for advances that have the short-term potential to improve AI safety and reliability without reducing AI innovation in critical domains. In this perspective, we discuss this vision based on several case studies that already produced proofs of concept in critical ML applications in biomedical science. Linking Predictions to Knowledge (P2K) The remarkable acceleration of AI technologies is effectively driving a flood of AI based methods and predictions. We are literally drowning in a sea of predictions and AI generated products and this trend is projected to increase.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Health & Medicine > Therapeutic Area > Oncology (0.68)
Protein-Mamba: Biological Mamba Models for Protein Function Prediction
Xu, Bohao, Lu, Yingzhou, Inoue, Yoshitaka, Lee, Namkyeong, Fu, Tianfan, Chen, Jintai
Protein function prediction is a pivotal task in drug discovery, significantly impacting the development of effective and safe therapeutics. Traditional machine learning models often struggle with the complexity and variability inherent in predicting protein functions, necessitating more sophisticated approaches. In this work, we introduce Protein-Mamba, a novel two-stage model that leverages both self-supervised learning and fine-tuning to improve protein function prediction. The pre-training stage allows the model to capture general chemical structures and relationships from large, unlabeled datasets, while the fine-tuning stage refines these insights using specific labeled datasets, resulting in superior prediction performance. Our extensive experiments demonstrate that Protein-Mamba achieves competitive performance, compared with a couple of state-of-the-art methods across a range of protein function datasets. This model's ability to effectively utilize both unlabeled and labeled data highlights the potential of self-supervised learning in advancing protein function prediction and offers a promising direction for future research in drug discovery.
- North America > United States > Minnesota (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)